8  Linear and multiple regression

Author

Vladimir Buskin

The following introduction is based on Heumann et al. (2022: Chapter 11), James et al. (2021: Chapter 3), Levshina (2015: Chapter 7) and Winter (2020: Chapter 4).

8.1 Preparation

# Load libraries

library("readxl")
library("writexl")
library("tidyverse")

# Load file
ELP <- read_xlsx("ELP.xlsx")

# Inspect file structure
str(ELP)

8.2 Introduction

Consider the distribution of the continuous variable RT (reaction times) from the ELP (English Lexicon Project) dataset. We will \(log\)-transform the reaction times to even out the differences between extremely high and extremely low frequency counts (cf. Winter 2020: 90-94).

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We are investigating the relationship between reaction times RT and the frequency Freq of a lexical stimulus.

Some open questions:

  • Can word frequency help us explain variation in reaction times?

  • If it can, then how could we characterise the effect of word frequency? In other words, does it increase or decrease reaction times?

  • What reaction times should we expect for new observations?

8.2.1 Building a statistical model

  • Here RT is a response or target that we wish to explain. We generically refer to the response as \(Y\).

  • Freq is a feature, input, or predictor, which we name \(X\).

We can thus summarise our preliminary and fairly general statistical model as

\[ Y = f(X) + \epsilon. \] While the term \(f(X)\) denotes the contribution of \(X\) to the explanation of \(Y\), \(\epsilon\) describes the errors of the model.

8.3 Linear Regression

  • Linear regression is a simple approach to supervised machine learning where the response variable is known. It assumes that the dependence of \(Y\) on \(X\) is linear.

  • This approach is suited for numerical response variables. The predictors can be either numerical or discrete/categorical.

  • Although it may seem overly simplistic, linear regression is extremely useful both conceptually and practically.

8.3.1 Model with a single predictor \(X\)

  • A linear model of our data would have the form

\[ Y = \beta_0 + \beta_1X + \epsilon \]

or, in more concrete terms,

\[ \text{Reaction time} = \beta_0 + \beta_1\text{Frequency} + \text{Model Error,} \]

where \(\beta_0\) and \(\beta_1\) are two unknown constants that represent the intercept and slope, respectively. Together they are referred to as the model coefficients (or parameters), and \(\epsilon\) is the error term. The fact that assumptions are made about the form of the model renders it a parametric model.

  • Given the existing data on \(X\) and \(Y\), which is also known as the training data, we estimate the model coefficients \(\beta_0\) and \(\beta_1\). While we use the training data to estimate these coefficients, we cannot know the true values of the coefficients, i.e., the exact true relationship between the variables.

  • The most common way of estimating parameters for linear models is the least squares approach. In essence, the parameters are chosen such that the residual sum of squares1, i.e., the sum of the differences between observed and predicted values, is as low as possible.

  • We can then predict future sales using the formula

\[ \hat{y} = \hat{\beta}_0 + \hat{\beta}_1x, \]

where \(\hat{y}\) indicates a prediction of \(Y\) on the basis of the predictor values \(X = x\). The hat symbol ^ marks estimated values.

  • In R, we can fit a linear model with the lm() function.
# Fit linear model

rt.lm1 = lm(log(RT) ~ log(Freq), data = ELP)

# View model data

summary(rt.lm1)

Call:
lm(formula = log(RT) ~ log(Freq), data = ELP)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.29765 -0.08203 -0.01205  0.07298  0.43407 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  6.633361   0.004286 1547.82   <2e-16 ***
log(Freq)   -0.048602   0.002201  -22.08   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1235 on 878 degrees of freedom
Multiple R-squared:  0.357, Adjusted R-squared:  0.3563 
F-statistic: 487.5 on 1 and 878 DF,  p-value: < 2.2e-16

The model statistics comprise the following elements:

  1. Call, i.e., the model formula.

  2. Residuals: These indicate the difference between the observed values in the data set and the values predicted by the model (= the fitted values). These correspond to the error term \(\epsilon\). The lower the residuals, the better the model describes the data.

# Show fitted values (= predictions) for the first six observations

head(rt.lm1$fitted.values)
       1        2        3        4        5        6 
6.635345 6.563152 6.789806 6.613980 6.630529 6.574894 
# Show deviation of the fitted values from the observed values

head(rt.lm1$residuals)
          1           2           3           4           5           6 
 0.03778825 -0.02277165  0.07759556  0.03387713  0.15222949 -0.10432670 
  1. Coefficients: The regression coefficients correspond to \(\hat{\beta}_0\) (“Intercept”) and \(\hat{\beta}_1\) (“log(Freq)”), respectively. The model shows that for a one-unit increase in log-frequency the log-reaction time decreases by approx. -0.05.
# Convert coefficients to a tibble 

library("broom")

tidy_model <- tidy(rt.lm1)

tidy_model
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   6.63     0.00429    1548.  0       
2 log(Freq)    -0.0486   0.00220     -22.1 2.88e-86
  1. \(p\)-values and \(t\)-statistic: Given the null hypothesis \(H_0\) that there is no correlation between log(RT) and log(Freq) (i.e., \(H_0: \beta_1 = 0\)), a \(p\)-value lower than 0.05 indicates that \(\beta_1\) considerably deviates from 0, thus providing evidence for the alternative hypothesis \(H_1: \beta_1 \ne 0\). Since \(p < 0.001\), we can reject \(H_0\).

    The \(p\)-value itself crucially depends on the \(t\)-statistic2, which measures “the number of standard deviations that \(\hat{\beta_1}\) is away from 0” (James et al. 2021: 67) . The standard error (SE) reflects how much an estimated coefficient differs on average from the true values of \(\beta_0\) and \(\beta_1\). They can be used to compute the 95% confidence interval \([\hat{\beta}_1 - 2 \cdot SE(\hat{β}_1), \hat{\beta}_1 + 2 \cdot SE(\hat{β}_1)]\); the true estimate of the parameter \(\beta_1\) lies within the specified range 95% of the time.

# Compute confidence intervals for intercept and log(Freq)

tidy_model_ci <- tidy(rt.lm1, conf.int = TRUE)

tidy_model_ci
# A tibble: 2 × 7
  term        estimate std.error statistic  p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)   6.63     0.00429    1548.  0          6.62      6.64  
2 log(Freq)    -0.0486   0.00220     -22.1 2.88e-86  -0.0529   -0.0443

The estimated parameter for log(Freq), which is -0.049, thus has the 95% confidence interval [-0.053, -0.044].

  1. The residual standard error (RSE)3 is an estimation of the average deviation from the observed values.

  2. \(R^2\)4 is important for assessing model fit because it “measures the proportion of variability in \(Y\) that can be explained using \(X\)” [James et al. (2021): 70; emphasis removed], varying between 0 and 1.

  3. The \(F\)-statistic is used to measure the association between the dependent variable and the independent variable(s). Generally speaking, values greater than 1 indicate a possible correlation. A very low \(p\)-value suggests that the null hypothesis \(H_0: \beta_1 = 0\) can be rejected.

8.3.2 Multiple linear regression

In multiple linear regression, more than one predictor variable is taken into account. For instance, modelling log(RT) as a function of log(Freq), POS and Length requires a more complex model of the form

\[ Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_pX_p + \epsilon.\]

Predictions are then obtained via the formula

\[\hat{y} = \hat{\beta}_0 + \hat{\beta}_1x_1 + \hat{\beta}_2x_2 + ... + \hat{\beta}_px_p. \]

In R, a multiple regression model is fitted as in the code example below:

# Fit multiple regression model

rt.lm2 <- lm(log(RT) ~ log(Freq) + POS + Length, data = ELP)

# View model statistics

summary(rt.lm2)

Call:
lm(formula = log(RT) ~ log(Freq) + POS + Length, data = ELP)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.26955 -0.07853 -0.00672  0.07067  0.39528 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  6.459742   0.016946 381.205  < 2e-16 ***
log(Freq)   -0.038071   0.002130 -17.874  < 2e-16 ***
POSNN       -0.006242   0.010157  -0.615  0.53902    
POSVB       -0.035234   0.012125  -2.906  0.00375 ** 
Length       0.023094   0.001711  13.495  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1114 on 875 degrees of freedom
Multiple R-squared:  0.4784,    Adjusted R-squared:  0.476 
F-statistic: 200.6 on 4 and 875 DF,  p-value: < 2.2e-16

8.4 Visualising regression models

Plot coefficient estimates:

# Tidy the model output
tidy_model <- tidy(rt.lm2, conf.int = TRUE)

# Remove intercept
tidy_model <- tidy_model %>% filter(term != "(Intercept)")

# Create the coefficient plot
ggplot(tidy_model, aes(x = estimate, y = term)) +
  geom_point() +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.2) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "steelblue") +
  theme_minimal() +
  labs(
    x = "Coefficient Estimate",
    y = "Predictor",
    title = "Coefficient Estimates with Confidence Intervals"
  )

Plot contributions of individual variable values:

# Plot marginal effects

library("effects")
Loading required package: carData
lattice theme set by effectsTheme()
See ?effectsTheme for details.
plot(Effect("Freq", mod = rt.lm2))

plot(Effect("POS", mod = rt.lm2))

plot(Effect("Length", mod = rt.lm2))

8.5 Model assumptions and diagnostics

As a parametric method, linear regression makes numerous assumptions about the training data. It is, therefore, essential to run further tests to rule out possible violations. Among other things, the model assumptions include:

  • A linear relationship between the response and the quantitative predictors: The residuals should not display a clear pattern. For this reason, it is recommended to use component residual plots (e.g., crPlot() from the car library) for the visual identification of potentially non-linear trends.

  • No heteroscedasticity (i.e, non-constant variance of error terms): Visually, a violation of this assumption becomes apparent if the residuals form a funnel-like shape. It is also possible to conduct a non-constant variance test ncvTest(): If it returns \(p\)-values < 0.05, this suggests non-constant variance.

  • No multicollinearity: Predictors should not be correlated with each other. In the model data, correlated variables have unusually high standard errors, thereby decreasing the explanatory power of both the coefficients and the model as a whole. Another diagnostic measure are variance inflation factors (VIF-scores); predictors with VIF scores > 5 are potentially collinear. They can be computed using the vif() function.

  • Normally distributed residuals: The residuals should follow the normal distribution. Usually, a visual inspection using qqnorm() is sufficient, but the Shapiro-Wilke test shapiro.test() can also be run on the model residuals. Note that a \(p\)-value below 0.05 provides evidence for non-normality.

Important

Beside the points mentioned above, it is always recommend to examine the model with regard to

  • outliers that might skew the regression estimates,

  • interactions, i.e., combined effects of predictors, and

  • overfitting, which results in poor model performance outside the training data.